{
  "nbformat": 4,
  "nbformat_minor": 0,
  "metadata": {
    "colab": {
      "name": "PLSA.ipynb",
      "version": "0.3.2",
      "provenance": [],
      "collapsed_sections": []
    },
    "kernelspec": {
      "name": "python3",
      "display_name": "Python 3"
    }
  },
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "0I-Es-jovJzm",
        "colab_type": "text"
      },
      "source": [
        "# 概率潜在语义分析"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "EeHek0KrvNbO",
        "colab_type": "text"
      },
      "source": [
        "概率潜在语义分析（probabilistic latent semantic analysis, PLSA）,也称概率潜在语义索引（probabilistic latent semantic indexing, PLSI）,是一种利用概率生成模型对文本集合进行话题分析的无监督学习方法。\n",
        "\n",
        "模型最大特点是用隐变量表示话题，整个模型表示文本生成话题，话题生成单词，从而得到单词-文本共现数据的过程；假设每个文本由一个话题分布决定，每个话题由一个单词分布决定。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "ZpnNY-eRwjq3",
        "colab_type": "text"
      },
      "source": [
        "### **18.1.2 生成模型**\n",
        "\n",
        "假设有单词集合 $W = $ {$w_{1}, w_{2}, ..., w_{M}$}， 其中M是单词个数；文本（指标）集合$D = $ {$d_{1}, d_{2}, ..., d_{N}$}, 其中N是文本个数；话题集合$Z = $ {$z_{1}, z_{2}, ..., z_{K}$}，其中$K$是预先设定的话题个数。随机变量 $w$ 取值于单词集合；随机变量 $d$ 取值于文本集合，随机变量 $z$ 取值于话题集合。概率分布 $P(d)$、条件概率分布 $P(z|d)$、条件概率分布 $P(w|z)$ 皆属于多项分布，其中 $P(d)$ 表示生成文本 $d$ 的概率，$P(z|d)$ 表示文本 $d$ 生成话题 $z$ 的概率，$P(w|z)$ 表示话题 $z$ 生成单词 $w$ 的概率。\n",
        "\n",
        "   每个文本 $d$ 拥有自己的话题概率分布 $P(z|d)$，每个话题 $z$ 拥有自己的单词概率分布 $P(w|z)$；也就是说**一个文本的内容由其相关话题决定，一个话题的内容由其相关单词决定**。\n",
        "   \n",
        "   生成模型通过以下步骤生成文本·单词共现数据：  \n",
        "   （1）依据概率分布 $P(d)$，从文本（指标）集合中随机选取一个文本 $d$ , 共生成 $N$  个文本；针对每个文本，执行以下操作；  \n",
        "   （2）在文本$d$ 给定条件下，依据条件概率分布 $P(z|d)$, 从话题集合随机选取一个话题 $z$, 共生成 $L$ 个话题，这里 $L$ 是文本长度；  \n",
        "   （3）在话题 $z$ 给定条件下，依据条件概率分布 $P(w|z)$ , 从单词集合中随机选取一个单词 $w$. \n",
        "   \n",
        " 注意这里为叙述方便，假设文本都是等长的，现实中不需要这个假设。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "_YwFFCuCgugI",
        "colab_type": "text"
      },
      "source": [
        "生成模型中， 单词变量 $w$ 与文本变量 $d$ 是观测变量， 话题变量 $z$ 是隐变量， 也就是说模型生成的是单词-话题-文本三元组合 ($w, z ,d$)的集合， 但观测到的单词-文本二元组 （$w, d$）的集合， 观测数据表示为单词-文本矩阵 $T$的形式，矩阵 $T$ 的行表示单词，列表示文本， 元素表示单词-文本对（$w, d$）的出现次数。  \n",
        "\n",
        "从数据的生成过程可以推出，文本-单词共现数据$T$的生成概率为所有单词-文本对($w,d$)的生成概率的乘积：  \n",
        "\n",
        "$P(T) = \\prod_{w,d}P(w,d)^{n(w,d)}$  \n",
        "\n",
        "这里 $n(w,d)$ 表示 ($w,d$)的出现次数，单词-文本对出现的总次数是 $N*L$。 每个单词-文本对（$w,d$）的生成概率由一下公式决定：  \n",
        "\n",
        "$P(w,d) = P(d)P(w|d)$   \n",
        "\n",
        "$= P(d)\\sum_{z}P(w,z|d)$  \n",
        "\n",
        "$=P(d)\\sum_{z}P(z|d)P(w|z)$"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "rIUH6dILnmQs",
        "colab_type": "text"
      },
      "source": [
        "### **18.1.3 共现模型**\n",
        "\n",
        "$P(w,d) = \\sum_{z\\in Z}P(z)P(w|z)P(d|z)$"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "JSt5kq4LoFJT",
        "colab_type": "text"
      },
      "source": [
        "虽然生成模型与共现模型在概率公式意义上是等价的，但是拥有不同的性质。生成模型刻画文本-单词共现数据生成的过程，共现模型描述文本-单词共现数据拥有的模式。  \n",
        "\n",
        "如果直接定义单词与文本的共现概率 $P(w,d)$, 模型参数的个数是 $O(M*N)$, 其中 $M$ 是单词数， $N$ 是文本数。 概率潜在语义分析的生成模型和共现模型的参数个数是 $O(M*K + N*K)$, 其中 $K$ 是话题数。 现实中 $K<<M$, 所以**概率潜在语义分析通过话题对数据进行了更简洁的表示，减少了学习过程中过拟合的可能性**。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "I_SM4HMBpocd",
        "colab_type": "text"
      },
      "source": [
        "### 算法 18.1 （概率潜在语义模型参数估计的EM算法）"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "HVWrjFktp_UF",
        "colab_type": "text"
      },
      "source": [
        "输入： 设单词集合为 $W = ${$w_{1}, w_{2},..., w_{M}$}, 文本集合为 $D=${$d_{1}, d_{2},..., d_{N}$}, 话题集合为 $Z=${$z_{1}, z_{2},..., z_{K}$}, 共现数据 $\\left \\{ n(w_{i}, d_{j}) \\right \\}, i = 1,2,..., M, j = 1,2,...,N;$  \n",
        "\n",
        "输出： $P(w_{i}|z_{k})$ 和 $P(z_{k}|d_{j})$.\n",
        "\n",
        "1. 设置参数 $P(w_{i}|z_{k})$ 和 $P(z_{k}|d_{j})$ 的初始值。\n",
        "\n",
        "2. 迭代执行以下E步，M步，直到收敛为止。  \n",
        "\n",
        " E步：  \n",
        "    $P(z_{k}|w_{i},d_{j})=\\frac{P(w_{i}|z_{k})P(z_{k}|d_{j})}{\\sum_{k=1}^{K}P(w_{i}|z_{k})P(z_{k}|d_{j})}$  \n",
        "  \n",
        " M步：  \n",
        "    $P(w_{i}|z_{k})=\\frac{\\sum_{j=1}^{N}n(w_{i},d_{j})P(z_{k}|w_{i},d_{j})}{\\sum_{m=1}^{M}\\sum_{j=1}^{N}n(w_{m},d_{j})P(z_{k}|w_{m},d_{j})}$  \n",
        "    \n",
        "    $P(z_{k}|d_{j}) = \\frac{\\sum_{i=1}^{M}n(w_{i},d_{j})P(z_{k}|w_{i},d_{j})}{n(d_{j})}$"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {
        "id": "M1DKiR8_wJox",
        "colab_type": "text"
      },
      "source": [
        "#### 习题 18.3"
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "m7xWyhIou6St",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "import numpy as np"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "D0OLpdWcwOGd",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 215
        },
        "outputId": "021c288f-95c7-4451-ab71-49c568c4598a"
      },
      "source": [
        "X = [[0,0,1,1,0,0,0,0,0], \n",
        "     [0,0,0,0,0,1,0,0,1], \n",
        "     [0,1,0,0,0,0,0,1,0], \n",
        "     [0,0,0,0,0,0,1,0,1], \n",
        "     [1,0,0,0,0,1,0,0,0], \n",
        "     [1,1,1,1,1,1,1,1,1], \n",
        "     [1,0,1,0,0,0,0,0,0], \n",
        "     [0,0,0,0,0,0,1,0,1], \n",
        "     [0,0,0,0,0,2,0,0,1], \n",
        "     [1,0,1,0,0,0,0,1,0], \n",
        "     [0,0,0,1,1,0,0,0,0]]\n",
        "X = np.asarray(X);X"
      ],
      "execution_count": 4,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[0, 0, 1, 1, 0, 0, 0, 0, 0],\n",
              "       [0, 0, 0, 0, 0, 1, 0, 0, 1],\n",
              "       [0, 1, 0, 0, 0, 0, 0, 1, 0],\n",
              "       [0, 0, 0, 0, 0, 0, 1, 0, 1],\n",
              "       [1, 0, 0, 0, 0, 1, 0, 0, 0],\n",
              "       [1, 1, 1, 1, 1, 1, 1, 1, 1],\n",
              "       [1, 0, 1, 0, 0, 0, 0, 0, 0],\n",
              "       [0, 0, 0, 0, 0, 0, 1, 0, 1],\n",
              "       [0, 0, 0, 0, 0, 2, 0, 0, 1],\n",
              "       [1, 0, 1, 0, 0, 0, 0, 1, 0],\n",
              "       [0, 0, 0, 1, 1, 0, 0, 0, 0]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 4
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "rz5CHJLSxWs0",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 35
        },
        "outputId": "e4a614fc-21ca-42af-df12-6f57657b04ab"
      },
      "source": [
        "X.shape"
      ],
      "execution_count": 3,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "(11, 9)"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 3
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "5LwECryz-Gp5",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 179
        },
        "outputId": "918425d3-7323-4a18-86da-9b78b298a1f3"
      },
      "source": [
        "X = X.T;X"
      ],
      "execution_count": 11,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[0, 0, 0, 0, 1, 1, 1, 0, 0, 1, 0],\n",
              "       [0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0],\n",
              "       [1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0],\n",
              "       [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],\n",
              "       [0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1],\n",
              "       [0, 1, 0, 0, 1, 1, 0, 0, 2, 0, 0],\n",
              "       [0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0],\n",
              "       [0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0],\n",
              "       [0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 11
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "zoRNiZHYxZ1j",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "class PLSA:\n",
        "    def __init__(self, K, max_iter):\n",
        "        self.K = K\n",
        "        self.max_iter = max_iter\n",
        "        \n",
        "    def fit(self, X):\n",
        "        n_d, n_w = X.shape\n",
        "        \n",
        "        # P(z|w,d)\n",
        "        p_z_dw = np.zeros((n_d, n_w, self.K))\n",
        "        \n",
        "        # P(z|d)\n",
        "        p_z_d = np.random.rand(n_d, self.K)  \n",
        "        \n",
        "        # P(w|z)\n",
        "        p_w_z = np.random.rand(self.K, n_w) \n",
        "        \n",
        "        \n",
        "        for i_iter in range(self.max_iter):\n",
        "            # E step\n",
        "            for di in range(n_d):\n",
        "                for wi in range(n_w):\n",
        "                    sum_zk = np.zeros((self.K))\n",
        "                    for zi in range(self.K):\n",
        "                        sum_zk[zi] = p_z_d[di, zi] * p_w_z[zi, wi]\n",
        "                    sum1 = np.sum(sum_zk)\n",
        "                    if sum1 == 0:\n",
        "                        sum1 = 1\n",
        "                    for zi in range(self.K):\n",
        "                        p_z_dw[di, wi, zi] = sum_zk[zi] / sum1\n",
        "\n",
        "\n",
        "            # M step\n",
        "\n",
        "            # update P(z|d)\n",
        "            for di in range(n_d):\n",
        "                for zi in range(self.K):\n",
        "                    sum1 = 0.\n",
        "                    sum2 = 0.\n",
        "\n",
        "                    for wi in range(n_w):\n",
        "                        sum1 = sum1 + X[di, wi] * p_z_dw[di, wi, zi]\n",
        "                        sum2 = sum2 + X[di, wi]\n",
        "\n",
        "                    if sum2 == 0:\n",
        "                        sum2 = 1\n",
        "                    p_z_d[di, zi] = sum1 / sum2\n",
        "\n",
        "            # update P(w|z)\n",
        "            for zi in range(self.K):\n",
        "                sum2 = np.zeros((n_w))\n",
        "                for wi in range(n_w):\n",
        "                    for di in range(n_d):\n",
        "                        sum2[wi] = sum2[wi] + X[di, wi] * p_z_dw[di, wi, zi]\n",
        "                sum1 = np.sum(sum2)\n",
        "                if sum1 == 0:\n",
        "                    sum1 = 1\n",
        "                    for wi in range(n_w):\n",
        "                        p_w_z[zi, wi] = sum2[wi] / sum1\n",
        "                    \n",
        "                    \n",
        "        return p_w_z, p_z_d\n",
        "    \n",
        "# https://github.com/lipiji/PG_PLSA/blob/master/plsa.py"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "mujaObW481zI",
        "colab_type": "code",
        "colab": {}
      },
      "source": [
        "model = PLSA(2, 100)\n",
        "p_w_z, p_z_d = model.fit(X)"
      ],
      "execution_count": 0,
      "outputs": []
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "EGrRJHeCCEBw",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 89
        },
        "outputId": "6a0fb7f5-04aa-4695-f0fa-2295696f9943"
      },
      "source": [
        "p_w_z"
      ],
      "execution_count": 25,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[0.25281765, 0.53794268, 0.60544526, 0.85760766, 0.23643854,\n",
              "        0.16258125, 0.77282068, 0.14683982, 0.44248871],\n",
              "       [0.62856568, 0.02317636, 0.91884258, 0.80113768, 0.36035441,\n",
              "        0.36354421, 0.80388194, 0.52256403, 0.5249816 ]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 25
        }
      ]
    },
    {
      "cell_type": "code",
      "metadata": {
        "id": "SbdVC9ICCKW4",
        "colab_type": "code",
        "colab": {
          "base_uri": "https://localhost:8080/",
          "height": 215
        },
        "outputId": "acc5f5c0-52d0-4a63-c321-6d890b4751f9"
      },
      "source": [
        "p_z_d"
      ],
      "execution_count": 26,
      "outputs": [
        {
          "output_type": "execute_result",
          "data": {
            "text/plain": [
              "array([[6.47069416e-09, 9.99999994e-01],\n",
              "       [6.05006497e-21, 1.00000000e+00],\n",
              "       [6.72897510e-01, 3.27102490e-01],\n",
              "       [2.01323848e-05, 9.99979868e-01],\n",
              "       [4.92302340e-38, 1.00000000e+00],\n",
              "       [2.72560935e-01, 7.27439065e-01],\n",
              "       [5.48439452e-29, 1.00000000e+00],\n",
              "       [2.37025537e-06, 9.99997630e-01],\n",
              "       [9.95781362e-25, 1.00000000e+00],\n",
              "       [6.56993414e-37, 1.00000000e+00],\n",
              "       [1.31568004e-07, 9.99999868e-01]])"
            ]
          },
          "metadata": {
            "tags": []
          },
          "execution_count": 26
        }
      ]
    }
  ]
}